\chapter{Data representation}
\label{chap-data-representation}

\section{Introduction}

In this chapter, we elaborate on the representation of essential data
types in \sysname{}.  Some such data types are defined in extracted
systems, usually as standard objects.  Those data types are:

\begin{itemize}
\item streams (\refSec{sec-streams}),
\item hash tables (\refSec{sec-hash-tables}),
\item packages (\refSec{sec-packages}),
\item arrays (\refSec{sec-array}), and
\item structures (\refSec{sec-structures})
\end{itemize}

\section{Low-level tag bits}

The three least significant bits of a machine word contain a
\emph{major tag} that is used to represent four different object types
as follows:

\begin{itemize}
\item \texttt{000}, \texttt{010}, \texttt{100}, \texttt{110}.  These
  tags are used for fixnums.  The bits except the last one represent
  an integer in two's complement representation.  On a 64-bit machine,
  a fixnum is thus in the interval $[-2^{-62}, 2^{62} - 1]$.
\item \texttt{001}.  This tag is used for \texttt{cons} cells.  A
  pointer to a \texttt{cons} cell is thus a pointer (aligned to a
  double word) to which the machine integer $1$ has been added.  See
  \refSec{sec-data-representation-cons-cells} for more information
  about the representation of \texttt{cons} cells.
\item \texttt{011}.  This tag is used for various \emph{immediate}
  objects, and in particular for \emph{characters}.
  \seesec{sec-data-representation-immediate-objects}.
\item \texttt{101}.  This tag is used for \emph{standard objects},
  which are all heap-allocated objects other than \texttt{cons} cells.
  Such an object is represented by a two-word header with one
  word containing a tagged pointer to the \emph{class object} and the
  other word containing a tagged pointer to the \emph{rack}.  See
  \refSec{sec-data-representation-standard-object} for more
  information about the representation of standard objects.
\item \texttt{111}.  This tag is used to tag a pointer to a
  \emph{rack}.  Notice that a rack is not a first-class \commonlisp{}
  object, so it can not be the value of any variable, slot, or
  argument.
\end{itemize}

On a 64-bit machine, floats of type \texttt{single-float} are
represented as immediate values.  We currently do not plan a separate
type for short floats.

\section{Immediate objects}
\label{sec-data-representation-immediate-objects}

Immediate objects are all objects with \texttt{011} in the lower three
bits.  Two more bits are used for a \emph{minor tag} to distinguish
between different kinds of immediate objects, giving the following
five low bits:

\begin{itemize}
\item \texttt{00011}.  This tag is used for Unicode characters.  When
  shifted five positions to the right, the value gives the Unicode
  code point.
\item \texttt{01011}.  This tag is currently unused.
\item \texttt{10011}.  This tag is used for single floats (64-bit
  platforms only).
\item \texttt{11011}.  This tag is currently unused.
\end{itemize}

\subsection{Characters}

As indicated above, the low five bits of a character have the value
\texttt{00011}, and the corresponding Unicode code point is obtained by
shifting the value of the character five positions to the right.

We currently do not plan to supply a module for Unicode support.
Instead we are relying on the support available in the Unicode library
by Edi Weitz.

\subsection{Single floats}

On a 64-bit platform, a single float corresponds to a single-precision
IEEE floating-point value.  The value is stored in the
most-significant half of the word.

\section{\texttt{cons} cells}
\label{sec-data-representation-cons-cells}

A \texttt{cons} cell is represented as two consecutive machine
words aligned on a double-word boundary.

\section{Standard objects}
\label{sec-data-representation-standard-object}

Recall that a \emph{standard object} is a heap allocated object that
is not a \texttt{cons} cell.  All standard objects are represented in
(at least) two parts, a \emph{header} and a \emph{rack}.  The
header always consists of two consecutive words aligned on a
double-word boundary (just like \texttt{cons} cells).  The first word
always contains a tagged pointer to a \emph{class} object (which is
another standard object).  The second word contains a tagged pointer
(with tag $111$) to the first word of the rack.%
\footnote{But see
  \refApp{sec-alternative-representation-of-standard-objects}
for a possible alternative representation of standard-objects.}

One advantage of representing standard objects this way is that we can
make sure that the rack is \emph{internally consistent}.  To explain
what we mean by this concept, let us take an \emph{adjustable array}
as an example.  The implementation of \texttt{aref} must check that
the indices are valid, compute the offset of the element and then
access the element.  But in the presence of threads, between the index
check and the access, some other thread might have adjusted the array
so that the indices are no longer valid.  In most implementations, to
ensure that \texttt{aref} is \emph{thread safe}, it is necessary to
prevent other threads from intervening between the index check and the
access, for instance by using a \emph{lock}.  In \sysname{}, adjusting
the array involves creating a new rack with new dimensions, and then
with a single store instruction associate the new rack with the array.
The implementation of \texttt{aref} would then initially obtain a
pointer to the rack and then do the index check, the computation of
the offset, and the access without risking any interference.  No
locking is therefore required.  Another example is a \emph{generic
  function}.  When a method is added or deleted, or when a new suite
of argument classes is seen, the generic function must be
destructively updated.  Normally, this operation would require some
locking primitive in order to prevent other threads from invoking a
partially updated generic function.  In \sysname{}, to update a
generic function this way, a new rack would be allocated and the
modifications would be made there, leaving the original generic
function intact until the final instruction to store a reference to
the the new rack in the header.

The first entry of the rack of every standard object is a small fixnum
called the \emph{stamp} of the standard object.  The stamp is the
\emph{unique class number} of the class of the object, as it was when
the object was created or updated after having been obsolete.  The
main purpose of this information is to be used in \emph{generic
  function dispatch}.  It is also used to determine whether a standard
object is obsolete.  A standard object is obsolete when the stamp of
the object is not the same as the \emph{current} unique class number
of the class of the object.  An object becomes obsolete as a result
of changes to its class.  And a class is changed as a result of
passing it to \texttt{reinitialize-instance}.  Some classes can not be
reinitialized, in which case its instances can not become obsolete.

An obsolete object must be updated before its contents can be
accessed.  The standard specifically allows for these updates to be
delayed and not happen as a direct result of the class redefinition.
They must happen before an attempt is made to access some slot (or
some other information about the slots) of the instance.  It is
undesirable to make all instances directly accessible from the class,
because such a solution would waste space and would have to make sure
that memory leaks are avoided.  We must thus take into account the
presence of obsolete objects in the system, and update them if an
when an attempt is made to access their contents.

The solution is to store some kind of \emph{version} information in
the rack so that when an attempt is made to access an
obsolete instance, the instance can first be updated to correspond to
the current definition of its class.  This version information must
allow the system to determine whether any slots have been added or
removed since the instance was created.  Furthermore, if the garbage
collector traces an obsolete instance, then it must either first
update it, or the version information must allow the garbage collector
to trace the obsolete version of the instance.  Our solution allows
both.  We simply store a reference to the \emph{list of effective
  slots} that the class of the instance defined when the instance was
created.  This reference is stored as the \emph{second} word of the
rack (recall that the first word is taken up by the
\emph{stamp}).

This solution makes it possible to determine the layout of the rack of
an obsolete instance, so that it can be traced by the garbage
collector when necessary.  This solution also allows the system to
determine which slots have been added and which slots have been
removed since the instance was created.  In order to detect whether an
object is obsolete, the contents of the first word of the rack
(i.e. the \emph{stamp}) is compared to the \emph{class unique number}
of the class of the object.  However, this test is performed
automatically in most cases, because when an obsolete object is passed
as an argument to a generic function, the \emph{automaton} of the
discriminating function of the generic function will fail to find an
effective method, triggering an update of the object.
\seesec{sec-generic-function-dispatch-the-discriminating-function}.

Together, we sometimes refer to these words as the \emph{prefix} of
the rack.

\section{Funcallable standard objects}
\label{sec-data-representation-funcallable-standard-objects}

By definition, a \emph{funcallable standard object} is an instance of
a subclass of the class \texttt{funcallable-standard-object} which is
itself a subclass of the class \texttt{standard-object} and of the
class \texttt{function}. \seesec{sec-data-representation-functions}.

To make function invocation fast, we want every subclass of the class
\texttt{function} to be invoked in the same way, i.e. by loading the
\emph{static environment} into a register and then transferring
control to the \emph{entry point} of the function. The static
environment and the entry point are stored in the function object
and are loaded into registers as part of the function-call protocol.

When the funcallable standard object is a generic function, invoking
it amounts to transferring control to the \emph{discriminating
  function}.  However, the discriminating function can not \emph{be}
the generic function, because the \clos{} specification requires that
the discriminating function of a generic function can be replaced,
without changing the identity of the generic function itself.
Furthermore, the discriminating function does not have to be stored in
a slot of the generic function, because once it is computed and
installed, it is no longer needed.  In order for the generic function
itself to behave in exactly the same way as its discriminating
function, whenever a new discriminating function is \emph{installed},
the \emph{entry point} and the \emph{static environment} are copied
from the discriminating function to the corresponding slots of the
generic function itself.

Every \sysname{} function is a funcallable standard object.  If it is
an ordinary (i.e., not generic) function, it is an instance of the
class \texttt{simple-function} which is a subclass of
\texttt{funcallable-standard-object}.

\section{Call-site descriptor}
\label{sec-call-site-descriptor}

A \emph{call-site descriptor} is an object that is uniquely associated
with a \emph{call site}, i.e., an instruction that invokes a function.
The function being invoked can be a named global function, a lexical
function, or an anonymous function.%
\footnote{Though calls to lexical functions may be optimized, and
  they may therefore not respect the ordinary function-call protocol.
  In that case, it is possible that call-site information is not
  available.}
The call-site descriptor is
passed as an implicit argument to the callee.  The callee stores the
call-site descriptor in a fixed location relative to the frame pointer
in its stack frame, so that it can be found by the garbage collector,
the debugger, the backtrace inspector, etc.

The call-site descriptor contains the following information about the
caller containing the call site:

\begin{itemize}
\item The \emph{name} of the callee when the call site represents a
  named call.  This information is used by the backtrace inspector to
  display useful information to the programmer.  When the callee is a
  global function, this information is also used by the call-site
  manager to associate this call site with a function name in the
  global environment at the time when the caller is tied to that
  environment.
\item An index into the code vector where the unconditional
  \emph{jump} instruction that represents the call is located.
\item Source location information for the call site.
\item A \emph{trace map}, which is a bitmap indicating stack locations
  that may contain objects that the garbage collector must trace.  The
  compiler may omit stack locations that are known to contain
  \commonlisp{} objects that the garbage collector does not have to
  trace.  For a typical backend, index $i$ of the bitmap represents
  the stack location at address $b-i$ where $b$ is the value of the
  base pointer.
\item When the call represents a named call to a global function, the
  location of each argument to the call.  This information is used by
  the call-site manager to generate code for accessing each argument
  so that it can be passed to the callee, if required.%
  \footnote{The argument may not be required by the callee, for
    example if it is a keyword of a keyword/value pair.  In that case,
    it is possible that the callee needs only the value and not the
    keyword.}
  This list is an association list where each element is a pair
  \texttt{(\emph{type} . \emph{value})}.
  There can be three types of locations:
  \begin{enumerate}
  \item A stack location.  Then the element type is the keyword symbol
    \texttt{:stack}, and the value is an offset from the frame
    pointer.
  \item A register location. Then the element type is the keyword
    symbol \texttt{:register} and the value is the name of a
    register.
  \item A literal object.  Then the element type is the keyword symbol
    \texttt{:literal} and the value is the object itself.
  \end{enumerate}
\item For each argument representing a lexical location that is
  \emph{live} after the call, its stack location.  The backtrace
  inspector uses this information to display the arguments to a
  particular call.  But arguments that represent lexical locations
  that are not live after the call, may have been reclaimed by the
  garbage collector, so they can not be displayed.
\item For each live lexical location corresponding to a variable that
  is explicitly mentioned in source code:
  \begin{itemize}
  \item its location in the stack frame as an offset from the frame
    pointer, and
  \item information about the \emph{type} of the object that the
    lexical location can contain at this program point, i.e.,
    assumptions about the type that were exploited by the compiler in
    order to generate subsequent instructions.  The purpose of this
    information is to restrict assignments to this location from the
    debugger, so as to maintain the semantics of the code.
  \end{itemize}
\end{itemize}

The bitmap meant for the garbage collector is present only in
call-site descriptors. At every \emph{safe point} the thread tests the
flag \texttt{gc-requested} described in
\refChap{chap-garbage-collector}.  If that flag is set, a function
call is made to the local garbage collector.  Therefore, every safe
point corresponds to a function call, and thus the information is
needed by the garbage collector only at function calls.

Similarly, a debugger breakpoint corresponds to a function call to the
debugger.  Therefore, information about live variables need to be
available only at call sites.  In this case, the source location
information of the call-site descriptor is a location immediately
preceding a form that is about to be evaluated, or immediately
following a form that has just been evaluated.

\section{Function descriptor}
\label{data-representation-function-descriptor}

A \emph{function descriptor} is a standard object that contains
informormation about all functions that share the same code (but that
may have different static environments).  It contains the following
information:

\begin{itemize}
\item A reference to the \emph{code object}
  \seesec{data-representation-code-objects} that represents the
  compilation unit in which the function code is located.
\item The \emph{source information} of the function, in the form of 
  a start position and an end position of the source code of the
  function in the compilation unit.  The text itself is contained in
  the code object.
\item The lambda list of the function.
\item Information about one or more \emph{entry points} of the
  function, each one specifying an address of the entry point and the
  types of the arguments, and the location (a register or a stack
  location) where the function expects each argument.
\item A fixnum indicating the \emph{frame size} of the function.  This
  is the size of the stack frame that the function needs in order to
  perform its task. 
\end{itemize}

A reference to the function descriptor is located in the aligned word
immediately preceding the argument-parsing entry point in the code
vector.

\section{Code objects}
\label{data-representation-code-objects}

A \emph{code object} is a standard object that represents the
\emph{code} of a compilation unit.  The code of every function in the
compilation unit is present in a single code object.

A code object contains the following information:

\begin{itemize}
\item The executable native instructions in the form of a vector of
  unsigned bytes.
\item A vector of strings representing the source code of the
  compilation unit represented by this code object.  Each string
  corresponds to a line of code.
\item Information about the file that contains the source code of the
  compilation that the code object represents.
\item A list of \emph{call-site descriptors}
  \seesec{sec-call-site-descriptor}, one for each call site in the
  compilation unit.
\item A vector of \emph{literal objects}, including objects produced
  by \texttt{load-time-value} and then treated as literal objects at
  run time.  This vector is used only by the garbage collector.
  Literal objects are allocated in the global heap, so they do not
  move.  This feature allows us to refer to literal objects as
  constants in native code, but the garbage collector can not find
  such references in native code, hence this vector.
\end{itemize}

\section{Instances of built-in classes}

The only direct instances of built-in classes are fixnums, characters,
single floats, and \texttt{cons} cells.

\subsection{Instances of \texttt{sequence}}

The system class \texttt{sequence} can not be directly instantiated.
Instead, it serves as a superclass for the classes \texttt{list} and
\texttt{vector}.

The standard is a bit contradictory here, because
in some places it says that \texttt{list} and \texttt{vector}
represent an exhaustive partition of \texttt{sequence}%
\footnote{See for instance section 17.1}
but in other places it explicitly allows for other subtypes of
\texttt{sequence}.%
\footnote{See the definition of the system class \texttt{sequence}.}
The general consensus seems to be that other subtypes are allowed.

\section{Symbols}

A symbol is a standard object, and \texttt{symbol} is a standard
class.  A symbol has slots containing the following data:

\begin{enumerate}
\item The \emph{name} of the symbol.  The value of this slot is a
  string.
\item The \emph{package} of the symbol.  The value of this slot is a
  package or \texttt{NIL} if this symbol does not have a package.
\end{enumerate}

Notice that the symbol does not contain its \emph{value} as a global
variable, nor does it contain its definition as a \emph{function} in
the global environment.  Instead, this information is contained in an
explicit \emph{global environment} object.

Notice also that the symbol does not contain the \emph{property list}
associated with it.  This information is also kept separately in an
explicit \emph{global environment} object.

\section{Functions}
\label{sec-data-representation-functions}

\subsection{Function object}

Ordinary (non-generic) \sysname{} functions are instances of the class
named \texttt{simple\-function}.  The class \texttt{simple-function}
is a direct subclass of \texttt{funcallable\-standard-object}.  These
two symbols both have the package \texttt{sicl-clos} as their home
package.

In order to obtain reasonable performance, we represent functions in a
somewhat complex way, as illustrated by
\refFig{fig-function-representation}.

\begin{figure}
\begin{center}
\inputfig{fig-function-representation.pdf_t}
\end{center}
\caption{\label{fig-function-representation}
Representation of functions.}
\end{figure}

\refFig{fig-function-representation} shows two functions.  The two
functions were created from the same compilation unit, because they
share the same code object. \seesec{data-representation-code-objects}.

A function is represented as a two-word header (as usual) and a
rack with the following contents:

\begin{enumerate}
\item The obligatory \emph{stamp}.
\item The obligatory copy of the list of effective slots of the class
  \texttt{simple-function}.
\item The \emph{static environment}.  This is a vector containing
  closed-over variables and some other information.
\item The \emph{entry point}.  The entry
  point is a \emph{raw address} of an aligned word in the vector
  containing the instructions of the function.  Functions are always
  allocated in the global heap, so the code vector never moves.
  Therefore, this address will never need to be updated.  The word
  immediately preceding the entry point in the code vector contains a
  reference to the code object.  The garbage collector uses the entry
  point of a function to find the code object, thereby keeping the
  code object alive as long as there is at least one live function in
  the code object.
\item A fixnum indicating the \emph{frame size} of the function.
  While this information really belongs in the function descriptor as
  described in \refSec{data-representation-function-descriptor}, we
  duplicate it here for reasons of performance, because it is needed
  by the default calling convention (as described in
  \refChap{chap-calling-conventions}).
\end{enumerate}

The static environment contains the \emph{code object} as one of its
elements.

Since raw addresses are word aligned, they show up as \texttt{fixnum}s
when inspected by tools that are unaware of their special
signification.

In addition to these slots, a generic function also contains other
slots holding the list of its methods, and other information.

\subsection{Code structure}
\label{sec-function-code-structure}

The general code structure of a function is illustrated in
\refFig{fig-function-code-structure}.  The code in the \emph{general
  prelude} parses arguments into the lexical variable locations
required by the \emph{main function body}.

\begin{figure}
\begin{center}
\inputfig{fig-function-code-structure.pdf_t}
\end{center}
\caption{\label{fig-function-code-structure}
General code structure of a function.}
\end{figure}

The main function body is compiled so that a maximum of debug
information is present and so that lexical variables are preserved for
inspection from the debugger.  The main function body also contains
code for interacting with the debugger.  At each potential breakpoint,
there is a code snippet that tests whether a breakpoint might be set
there, and if so, communicates with the debugger to determine what the
action should be.  Furthermore, the main function body is compiled so
that each lexical variable has a fixed location in the stack frame,
and this location is where the lexical variable is located when the
application interacts with the debugger.  Therefore, the mapping from
lexical variables to stack locations is a property of the function
object and it is accessible to the debugger.  Furthermore, the
function object contains a mapping from values of the program counter
to bitmaps where a bitmap indicates which lexical variables are in
scope and live at that value of the program counter.  This information
is also included in the function object and it is accessible to the
debugger.  Thus, all the application needs to communicate to the
debugger at a breakpoint is the value of the program counter.  In
addition, the code of the main function body is compiled without
propagation of type information for lexical variables.  Whenever the
function accesses a lexical variable, if needed for the correct
execution of the function, the type of the value of the variable is
checked.  Therefore, the debugger is free to assign any new value to a
lexical variable.  If it is not the right type, this situation will be
detected and an error is signaled.

The example in \refFig{fig-function-code-structure} indicates that
there might be several places where the function returns, each place
having different return values.

The general structure shown in \refFig{fig-function-code-structure} is
the only one we plan to implement in the initial version of
\sysname{}.  In subsequent versions of \sysname{}, we take advantage
of the existence of the \emph{call-site manager} to optimize function
code and function calls, while preserving the possibility of good
debugging.  To accomplish this optimization, we include an additional
\emph{optimized function body} in each function as shown in 
\refFig{fig-function-code-structure-2}

\begin{figure}
\begin{center}
\inputfig{fig-function-code-structure-2.pdf_t}
\end{center}
\caption{\label{fig-function-code-structure-2}
Code structure of a function with optimized body.}
\end{figure}

The optimized function body is not suitable for debugging.  Some
lexical variable may have been eliminated, and there are no code
snippets for breakpoints.  The call-site manager is responsible for
parsing arguments, which is why the optimized function body is not
preceded by any prelude.  Instead, a call to this function goes
directly to the optimized function body.

\section{Conditions}

Conditions are standard objects.  The class \texttt{condition} is a
subclass of \texttt{standard-object}.  The class \texttt{condition} is
an instance of the class \texttt{sicl-clos:condition-class}.
We are using the library \predicament{} to implment conditiions, as
described in \refSec{sec-condition-system}.


% LocalWords:  callee
